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i-G ' Abstract 

Finding out statistically significant words in DNA and protein sequences forms the basis 
for many genetic studies. By applying the maximal entropy principle, we give one systematic 
way to study the nonrandom occurrence of words in DNA or protein sequences. Through 

o 
o 

Q\ . sites in Saccharomyces cerevisiae(yeast) genomes tend to occur significantly in the promoter 



comparison with experimental results, it was shown that patterns of regulatory binding 



regions. We studied two correlated gene family of yeast. The method successfully extracts 



the binding sites varified by experiments in each family. Many putative regulatory sites in 
& the upstream regions are proposed. The study also suggested that some regulatory sites are 

" t£ . active in both directions, while others show directional preference. 

>Y 

1 Introduction. 

>. 

• <— i . 

^ It is attractive, but not unexpected, that DNA and protein sequences deviate remarkably from 

random sequences [fjj . According to information theory, random sequences carry minimal infor- 
mation (maximal entropy) , while the total information of life is assumed to be in DNA and 
protein sequences. As a result, investigation on the non-randomness of DNA and amino acid 
sequences would be the focus of Bioinformatics. 

To find out nonrandom occurrence of words (short strings) in non-coding DNA sequences 
is interesting because a large portion of regulatory elements of eukaryotes usually are words of 



limit length in the non-coding sequences (for example, about 10 bases, while the core part is 
about 5 bases ||). subjected to functional constraints, the patterns of regulatory elements are 
expected to deviate from random occurrence. 

In this paper, by applying the Maximal Entropy Principle (MEP), we develop one way to 
investigate the nonrandom occurrence of words in DNA sequences. Each word is given one 
significance index which quantifies the nonrandomness occurrence of the word. The method is 
then applied to study the promoter regions of Saccharomyces cerevisiae (yeast). [f§ We compare 
the theoretical result with experiments in two ways. In the first way, the promoter database of 
yeast (SCPD) H was analysed. It was found that, statistically, overrepresented words are more 
easily encountered in the database. The second way is to study the promoters of coregulated 
gene families. The experimentally found binding sites were successfully extracted, and more 
putative binding sites are suggested. 

In the following the method will be developed in details, and in the third section the method 
will be applied to study the promoter regions of yeast. 

2 Treat the nonrandomness of DNA sequences via Maximal En- 
tropy Principle. 

The idea comes from a simple observation. Take a long DNA sequence as an example. Given 
only the (normalized) frequencies of A,C,G,T (Pa,Pc, Pg,Pt) , one would expected that the 
frequencies 2-tuples have the form 

P C1C2 = P ci x ^C2 V-U 

Here c\ and c^ are one of the four bases A, C, G, T. 

Comparison between the measured frequency P C1C 2 an d the expected value P°cic 2 reveals the 
statistical significance of c\C2 in the sequence. 

To generalize the above idea, one encounters the problem to predict the frequencies of k + 1- 
tuples from the frequencies of fc-tuples when k > 1. A reasonable defination can then be used 
to evaluate the statistical significance of words longer than two bases. 

The following is an attemption to answer this problem. In the treatment, when the compo- 
sition of a fc-tuple is concerned, the word will be written as C\C2- • -Cfc-iCfc. However when only 
the length k of the word is relevant, it will be given in the form of w . A combinatory form 
may also be used. For example, w k c (cw k ) is the word obtained by adding a letter c to the right 
(left) of w k . The measured and expected frequencies of w k in the sequence will be written as 
P w k and P° w k, respectively. 



There are a total of 4 fc-tuples. For prediction the Maximal Entropy Principle (MEP) 
is a prefered choice. According to modern genetics, the driving force for nucleotide sequence 
evolution is, on one hand, random mutations of bases that maximize the entropy, and, on 
the other hand, the natural selection which subjects the maximization of entropy to certain 
constraints. Therefore, DNA sequence analysis shows intrinsic correlation to the MEP. One 
brief introduction (which is necessary for our use) to MEP will be given below. More details 
can be found in e.g. [||. 

Suppose that {Pi,i = 0, 1, 2, • • •} is a discrete distribution. An information entropy can be 
defined on it f|] : 

S = J2PilnPi- (2) 

i 

Usually {Pi} satisfies some constraints: 

F J ({P i }) = 0, j = l,2,...,M. (3) 

Here M is the number of constraints. Define a target function: 

M 

H = S + Y,^m p i})^ ( 4 ) 

i=i 

Xj being Largrange factors. MEP states that the distribution minimizing the target function H 
is the most reasonable distribution satisfying constraints (||). This, however, does not state that 
{Pi} is the only distribution satisfying (|3|). 

The MEP now can be applied to study the problem raised above. The entropy function here 

is: 

S = 2^-P w k +^(i)l n P iu*H-l(i), 
i 

where i is a index used to distinguish k-tuples from each other. (In order to get the index of a 
word, the following maps were used: A to 0, C to 1, G to 2, and T to 3. The original word is 
thus mapped to a string containing only 0,1,2 and 3. The string is then considered as quaternary 
number. After being transformed to decimal, the number is used as the index of the word.) 
Constraints in the present problem is: 

p = V^ p° 

c 

P = \^ P° 

Jr w k (i) / y r cw k (i)i 

c 

i = 0,l,2,---,4 fe -l. (5) 



P° w k+i is the frequency needs to be predicted and P wk is the frequency already known. There 
is a total of 2x4 fc constraints. It is possible that these constraints are linearly related, so that 
the number of effective constraints is smaller than 2x4 fc . This, however, does not alternate the 
result. 

The solution can be obtained: 

pO _ -* ClC2-Cfc X -fC2C3---Cfc_|_l ,„-. 

r cic 2 -c k+ i — p W 

- r c 2 c 3 --c fc 

When k=l, the solution reduces to the intuitively result, eq.(|l]). 

The above treatment is from fc-tuples to k + 1-tuples. As a generic scheme, the MEP can 
also be applied to predict the frequencies of k + 2-tuples, k + 3-tuples and so on, based on the 
frequency of k-tuples. Actually, one can get the result by repeatedly applying eq. (6). For 
example: 

P° * P° 



pO _ ± C1C2— C fc + 1 ^ J C2C 3 ---C k + 2 

" C!C2---C k+1 C k+2 — p 

__ "ciC 2 ---C fc X ^C 2 C 3 --C k+1 X "c 3 C4---C fc+2 /„n 

P V P ' ^ ' 

r C2C3---C k A - r C 3 C 4 --C fc + 1 

Thus, when one refers to the expected frequency of a certain word of length k, the knowledge 

that the prediction is based on must be pointed out. 

With the frequencies of longer words, one can always obtain the frequencies of shorter ones. 

On the other hand, the expected frequencies of longer words, eq.(y), is predicted from the 

frequencies of shorter words, with no more information added. Therefore, the deviation of 

the measured frequencies from the expected ones gives new information emerges only in the 

frequencies of the longer words. In order to use this part of information, we refer to the following 

significance index 

P k-P° k 

The indexs of fc-tuples form a vector of 4 fc dimension. 

It should be pointed out that the simple solution eq.© results from the constraints, eq.©. 
Although there are many ways to write down the prediction J7], g] , the Maximal Entropy Principle 
ensure that, submitted to these constraints, the solution eq.(||) is the best one. However, one 
can consider more constraints. Expect for the continuous words, spaced patterns can also be 
involved in the above statistical treatment ||. As an example, consider the spaced word c\- 
C2, where c\ and C2 are certain bases and the base between them is not relevant. One more 
constraint 

Pcx—Ol = / , -* C1CC2 



can be added to the frequencies of 3-tuples, and the statistical significance of the spaced words 
can also be evaluated. The MEP, as a general framework, is still applicable, but there will be 
no simple explicit solution as eq.(|6J). 

3 The relationship between regulatory elements and statisti- 
cally significant words in the yeast promoter regions. 

With the accumulation of huge amount of genome sequences, analysis of the regulatory regions 
becomes urgent, because they govern the regulation of gene expression. Finding out the regula- 
tory sites in Eukaryotes genomes is especially difficult, largely because of their strong variance. 
This, however, gives the chance for statistical methods to play an important role in binding sites 
prediction. 

The regulatory elements are functionally constrained and are often shared by many genes. 
As a result, the sites are expected to be significantly represented. Based on this belief, the 
method developed above is expected to be applicable in finding regulatory sites in the promoter 
regions of yeast, we employ two ways to check this point. 

In the first way, as just an illustration of the effectiveness of the MEP treatment, a data 
set including all the promoters of yeast will be used to perform the statistical evaluation. The 
promoter regions refer, according to Zhang ||, to the upstream region of 500 bases long. From 
the sequence set the word frequencies are obtained and I w k , k = 2, • • • , 8, are calculated according 
to eq.(||) and eq.(||). (to obtain I w k, P° w k is predicted based on the frequency of k-1-tuples.) 
For comparison the index I w k, k = 2, • • • , 8, of words in the coding regions (CDSs) of yeast were 
also calculated. 

To compare the significance index of words with experimentally verified regulatory elements, 
a strongly statistically characterized method was pursued. The promoter database of yeast 
collected by Zhou et al. || was used as targets. One word is called to hit the target if it covers a 
known regulatory element or part of the element. In this way, each word will be checked against 
all the elements in the database. We want to see if the total hits of words show correlation with 
the significance index. 

Fig.l shows the ratio of the average hits of words whose significance index are larger than 
a certain cutoff (5.0, 3.0, or 2.0) to the average hits of all the fc-tuples. Some properties of 
significance index in the promoter regions are revealed. First, for all the cutoff value shown in 
fig.l, the ratios are always larger than 1. Second, when the words are longer than 4 bases, the 
average hits increase with the increase of cutoff. Furthermore, the ratio also increases with the 



increase of word length. As a comparison, Fig.l shows that the ratio of hits does not depend on 
significance index in the CDS regions. 

To see the dependence of hits on significance index further, words are divided into groups 
according to their significance index values. In each group the hits were averaged. See table 1, 
and Fig. 2 which is based upon the data in Table. 1 but shown as a more audio- visual illustration. 
The dependence of hits on significance index shown in Fig.l is seen again. Furthermore, the 
average hits are not the monotonic function of significance index in the promoter regions. For 
words with both positively and negatively large significance index in the promoter regions, the 
average hits are larger than those of words whose significance index is around zero. Again no 
dependence of average hits on significance index in CDS regions is observed in Table. 1 and Fig. 2. 

That words with large negative significant index in the promoter regions also show higher 
affinity to binding sites deserves more consideration. One account is that although some regula- 
tory elements, such as those involved in the expression of housekeeping genes, are expected to be 
overrepresented since large amount of the genes are needed, others that control the expression 
of some essential but restrictedly needed genes, are expected to be underrepresented to avoid 
inappropriate translation. However, more convincing explaination exists: if a word, e.g., wA, 
has high positive index, then some of wC,wG,wT are expected to have negative index. This can 
be seen from the following example. While the index of TATAT is 16.3, that of TATA A is -12.2. 
Actually, both have much high counts in the sequences and both are variance of binding site of 
the same transcriptional factor. 

For universally existing regulatory elements, as expected, the significance index in the pro- 
moter regions are much high. One example is the poly(A/T) stretches. As given above, the 
significance index of TATAT is 16.3. Also the significance index of TATATAT, 8.1, is high. As 
another example, the significance index of the core of CAAT-box, CAAT, is 8.95. However, 
in order to develop an algorithm for regulatory elements prediction, more subtle consideration 
must be involved. First, genes are needed to be classified into families to improve the composi- 
tional bias of the sequences. Furthermore, more complicated usage of the information given by 
significance index should be considered, because, according to eq.(^), the expected frequency of 
/c-mers can be defined in k — 1 ways, i.e., based on the frequency of 1, 2, • • • , k — 1-mers, respec- 
tively. For each definition the significance index can be obtained. On considering the statistical 
property of words in the sequences, each of these indexes would give useful information. We 
choose two coregulated gene family to further test our method. 

The coregulated genes of yeast metabolism have been widely studied, and these datasets pro- 
vide ideal material to test the methods for binding sites prediction. Two families of coregulated 



genes, GCN and TUP, were shown in table 2. Detailed information on them can be found in ||. 
For each family, the frequencies of 6-tuples in the promoter regions were first collected. The 
expected frequencies them were predicted in five ways, which are based on the frequencies of 
bases, 2-tuples, 3-tuples, 4-tuples, 5-tuples, respectively. In stead of I w k, a simpler significance 
index P w k/P° w k was used. In our study only the single strand of promoter sequences is consid- 
ered. This is different from that of PJ. They count the number of each words in both strands. 
In this way there are only 2080 distinct oligonucleotides, while the number in ours is 4 6 = 4096. 
Table 3 shows the words that possess no less than 3 among the 5 significance index larger than 3. 
There are 13 such words for GCN family, and 23 for TUP family. In table 3 several words tend 
to cluster together to form a longer pattern. Generally speaking, the clusters can be expanded 
by involving words with slightly lower significance. 

In both families, 6-tuples corresponding to regulatory binding sites found by experimental 
analysis are observed in table 3. See the first cluster of words for GCN family and the first and 
the second clusters for TUP family. Most of these words also show high statistical significance 
in the analysis of pj. Some words predicted by but not varified by experiments are also 
observed in table 3 (significant words shared by PJ and the present analysis are shown in bold in 
table 3). However, our analysis also found many significant words which do not show as highly 
significant scores according to Q . 

Two clusters of words for TUP families is noteworthy (see the first and the second clus- 
ters in table 3). The first cluster includes GTGGGG, AGGGGC, ACGGGC, TGGGGT, and 
GGGGTA, and the second cluster involves TACCCC, ACCCCG, CCCCGC, and CCCCAC. be- 
tween them GTGGGG and CCCCAC, GGGGTA and TACCCC are reverse complements. The 
two clusters both correspond to the binding sites of transcription factor Miglp (Zn finger), but 
seen from different strands. This may imply that the binding sites of Miglp are active in both 
orientation. This property, however, was not found for the binding sites of Gcn4p (see table 
3). For example, when the cutoff of significance index is reduced to 1.3 (now 46 words satisfy 
the creterion), the cluster of TGACTC and GACTCA expands to involve another 4 members: 
CGATGA, GATGAC, ATGACT, and GTGACT; while only one of inverse complements of them, 
GAGTCA, also has 3 index larger than 1.3. Among the 46 words, it can only be clustered with 
another words GGAGTC. Thus, the binding sites of Gcn4p seem to be active preferentially in 
one direction. 

Among the available methods of binding sites prediction, ours is similar to that of P] in that 
both work by defining expected frequencies of words, the difference is that our method defines 
the expected frequences on the statistical stproperties of the sequences themselves, while H 



more or less heuristically defines the word frequencies of whole non-coding sequences as the 
expected value. It is thus expected that our method is more precise and gives more unbiased 
result. 

An alternative method developed by Li et al |lCf| . gives more subtle consideration on the 
statistical feature of DNA sequences. In their model, the sequence is considered as a text without 
interwords dilimiters. They apply maximal likelihood consideration to recover the words, which 
they consider as possible binding site condidates. But the computation is far more complex to 
get meaningful result. 

More methods to detect unknown elements within funtionally related sequences are availible 



(for a review, see [11]), most of which, such as the consensus [12| and the Gibbs sampler [13|, 
are based upon well dinned biological models. The type of signals that can be detected are 
generally limited; it is difficult for them to detect multiple signals. But these methods are able 
to detect much larger patterns with high precision. The present method can be used to detect 
multiple elements, but the pattern it can find is short. 

It is also a widely explored problem in biology to compare the noncoding and coding regions of 



DNA sequences [14, 15, uM . The MEP treatment gives one systematic way to study the statistical 
differences between coding and noncoding regions. In table. 1 it is shown that significance index 
in CDS regions distribute much more stretchy than that of the promoter regions. The contrast 
keeps for all the word lengths we studied (up to 8 bases). This reveals that CDS regions are 
in a more nonrandom state. Two factors may help to interpret this phenomenon. First, the 
mutation rate of CDS regions is much lower than that of the promoter regions O]. Secondly, 
the code usage in CDS region is universal and definite, while in the promoter regions the length 
of regulatory elements differ from each other and the regulatory elements may differ strongly 
from the consensus sequences. 
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Figure 1: The ratio of average hits (H) of words above certain cutoff of significance in- 
dex to the average hits (Hq) of all the words of same length. The -ffo(word length) are 
405(2), 94.4(3), 21.7(4), 4.92(5), 1.10(6), 0.241(7), 0.0528(8). 
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Figure 2: The dependence of average hits of 6-tuples on their average significance index. The 
data in this figure are shown as a more audio- visual illustration of the 6-tuple data in Table 1. 
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Table 1: The dependence of average hits on the significance index /„, = w . w . The values 
shown in the hits volume are averaged over the hits of the points (words) included in the 
significance index range shown in the I w colummn. 
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Table 2: The coregulated gene family GCN and TUP, and criterion for them being clustered. 



Family Genes 



Shared regulatory property 



References 
Hinnebusch JL7] 



GCN ARG1,ARG3,ARG4,ARG8,AR03,AR04, 

AR07,CPA1,CPA2,GLN1,HIS1,HIS2, 
HIS3,HIS4,HIS5,HOM2,HOM3,HOM6, 
ILV1,ILV2,ILV5,LEU1,LEU2,LEU3, 
LEU4,LYS1,LYS2,LYS5,LYS9,MES1, 
MET14,MET3,MET6,TRP2,TRP3, 
TRP4,TRP5,THR1 

TUP FSP2,YNR073C,YOL157C,HXT15,SUC2, 

YNR071C,YDR533C,YEL070W,RNR2, 
YER067W,CWP1,YGR243W,YDR043C, 
YER096W,HXT6,YLR327C,YJL171C, 
YGR138C,HXT4,GSYl,YOR389W, 
MAL31YML131W,RCK1 



General amino acid contral; 
activated by Gcn4p. 



All genes which are both dereprcsscd 
by a facter larger than 4 when TUP1 
is deleted, and induced by a factor 
larger than during the diauxic shift 



DeRisi et al. [tig 
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Table 3: Highly overrepresented words in promoter regions of GCN and TUP family.For each 
family, the 6-tuples with no less than 3 among the 5 significance index larger than 3 are indicated. 
The words also appear in table 2 of j|] as significant patterns are highlighted in bold. Words 
are clustered according to their similarity, sig(i) is the value of P w e/P° w 6 with P° w e being the 
frequency of 6-tuple w 6 predicted based on the frequencied of i-tuples. 
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7 

10 


3.71 
3.75 


3.88 
3.15 


3.15 
3.22 


2.10 
1.94 


1.84 
1.55 


_ 


GTGCCA 


11 


3.76 


3.35 


3.06 


2.23 


1.37 




GGTGGT 


10 


3.26 


3.73 


3.14 


2.19 


1.53 


- 


TUP GTGGGG 
AGGGGC 
ACGGGC 
TGGGGT 
GGGGTA 


9 

10 
7 
9 

10 


6.67 
6.77 
4.49 
4.10 
4.39 


5.23 
3.90 
3.57 
3.21 
4.29 


3.76 
3.65 
3.23 
3.14 
3.52 


3.27 
2.64 
2.62 
3.39 
2.89 


1.47 
1.64 
1.97 
1.37 
1.58 


KANWWWWATSYGGGGW Miglp 

(Zn finger) 


TACCCC 
ACCCCG 
CCCCGC 
CCCCAC 


1(5 
11 

8 
12 


5.67 
6.34 
7.37 
6.55 


5.73 
5.24 
4.68 
5.05 


4.22 
5.15 
3.60 
3.49 


2.52 
3.27 
2.31 
2.27 


1.32 
1.39 
1.33 
1.36 


Complement of Miglp 
KANWWWWATSYGGGGW (Zn finger) 


AGGAGG 


11 


4.66 


3.79 


3.12 


1.70 


1.44 


- 


GGTGGT 


9 


4.10 


4.27 


3.41 


2.21 


1.31 




CTCGAG 
TCGAGG 


S 
9 


3.15 
3.75 


4.00 

3.88 


4.42 
4.38 


2.22 
2.15 


1.17 
1.73 


- 


GCGGAG 

CGGAGA 


7 
10 


4.74 
4.02 


4.07 
4.17 


3.20 
3.05 


1.84 
1.97 


1.35 
1.69 


- 


CTGCTA 
GTGCCT 

TGCCAC 


10 
17 
10 


2.42 
6.95 
3.74 


3.23 
6.81 
3.38 


4.28 
4.86 
3.02 


3.21 
3.34 
1.74 


1.90 
1.71 
1.51 


- 


GCGCCG 
GCAACG 
GCACGG 


1 
■' 

S 


4.10 
3.43 
5.13 


3.12 

2.88 
4.66 


3.13 
3.12 
3.12 


3.23 
3.13 
2.58 


2.67 
1.37 
1.66 


- 


CAGTGG 


8 


3.33 


3.48 


3.01 


1.90 


1.61 


- 


CGCGAT 


7 


2.76 


3.48 


4.12 


3.68 


2.083 


- 
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